Close

1. Identity statement
Reference TypeConference Paper (Conference Proceedings)
Sitesibgrapi.sid.inpe.br
Holder Codeibi 8JMKD3MGPEW34M/46T9EHH
Identifier8JMKD3MGPEW34M/45BRTJ8
Repositorysid.inpe.br/sibgrapi/2021/08.31.02.34
Last Update2021:08.31.02.34.11 (UTC) administrator
Metadata Repositorysid.inpe.br/sibgrapi/2021/08.31.02.34.11
Metadata Last Update2022:06.14.00.00.17 (UTC) administrator
DOI10.1109/SIBGRAPI54419.2021.00048
Citation KeyCorrea:2021:CoOpCh
TitleCombination of Optical Character Recognition Engines for Documents Containing Sparse Text and Alphanumeric Codes
FormatOn-line
Year2021
Access Date2024, May 06
Number of Files1
Size222 KiB
2. Context
AuthorCorrea, Iago Lourenço
AffiliationFederal University of Rio Grande (FURG)
EditorPaiva, Afonso
Menotti, David
Baranoski, Gladimir V. G.
Proença, Hugo Pedro
Junior, Antonio Lopes Apolinario
Papa, João Paulo
Pagliosa, Paulo
dos Santos, Thiago Oliveira
e Sá, Asla Medeiros
da Silveira, Thiago Lopes Trugillo
Brazil, Emilio Vital
Ponti, Moacir A.
Fernandes, Leandro A. F.
Avila, Sandra
e-Mail Addressiago.correa@outlook.com
Conference NameConference on Graphics, Patterns and Images, 34 (SIBGRAPI)
Conference LocationGramado, RS, Brazil (virtual)
Date18-22 Oct. 2021
PublisherIEEE Computer Society
Publisher CityLos Alamitos
Book TitleProceedings
Tertiary TypeFull Paper
History (UTC)2021-08-31 02:34:11 :: iago.correa@outlook.com -> administrator ::
2022-03-02 00:54:15 :: administrator -> menottid@gmail.com :: 2021
2022-03-02 13:23:54 :: menottid@gmail.com -> administrator :: 2021
2022-06-14 00:00:17 :: administrator -> :: 2021
3. Content and structure
Is the master or a copy?is the master
Content Stagecompleted
Transferable1
Version Typefinaldraft
Keywordsoptical character recognition
classifier combination
pattern recognition
tesseract
median string
AbstractMany companies that buy machines, parts, or tools retain documents such as notes, receipts, forms, or instruction manuals over the years, and they may find themselves in need of digitizing these accumulated documents. Thus, when using optical character recognition (OCR) systems in these documents, it is possible to note that these systems can present two main difficulties. The first is to locate the sparse text in a non-continuous way, and the second is to match words that are closer to codes and less to words in human language. Although there are many works in the literature about sparse texts, such as forms and tables, there is usually not much concern about the issue with codes in which one can not rely on dictionaries or even both problems together. Therefore, to correct this issue without having to search for extensive databases or conduct training and development of new models, this work proposed to take advantage of pre-trained models of OCR such as from the Tesseract engine or the Google Cloud's Vision API. In order to do so, we proposed the exploration of combination strategies, including a new one based on median string. The experimental results achieved up to 3.09% improvement in character accuracy and 1.16% in word accuracy in comparison to the best individual performances from the engines when our method based on string combination was adopted.
Arrangement 1urlib.net > SDLA > Fonds > SIBGRAPI 2021 > Combination of Optical...
Arrangement 2urlib.net > SDLA > Fonds > Full Index > Combination of Optical...
doc Directory Contentaccess
source Directory Contentthere are no files
agreement Directory Content
agreement.html 30/08/2021 23:34 1.3 KiB 
4. Conditions of access and use
data URLhttp://urlib.net/ibi/8JMKD3MGPEW34M/45BRTJ8
zipped data URLhttp://urlib.net/zip/8JMKD3MGPEW34M/45BRTJ8
Languageen
Target FilePaper ID 28.pdf
User Groupiago.correa@outlook.com
Visibilityshown
Update Permissionnot transferred
5. Allied materials
Mirror Repositorysid.inpe.br/banon/2001/03.30.15.38.24
Next Higher Units8JMKD3MGPEW34M/45PQ3RS
8JMKD3MGPEW34M/4742MCS
Citing Item Listsid.inpe.br/sibgrapi/2021/11.12.11.46 3
Host Collectionsid.inpe.br/banon/2001/03.30.15.38
6. Notes
Empty Fieldsarchivingpolicy archivist area callnumber contenttype copyholder copyright creatorhistory descriptionlevel dissemination edition electronicmailaddress group isbn issn label lineage mark nextedition notes numberofvolumes orcid organization pages parameterlist parentrepositories previousedition previouslowerunit progress project readergroup readpermission resumeid rightsholder schedulinginformation secondarydate secondarykey secondarymark secondarytype serieseditor session shorttitle sponsor subject tertiarymark type url volume


Close